Skip to content

feat(lumera): add LEP-6 chain client extensions#286

Open
j-rafique wants to merge 8 commits intomasterfrom
supernode/LEP-6-chain-client-extensions
Open

feat(lumera): add LEP-6 chain client extensions#286
j-rafique wants to merge 8 commits intomasterfrom
supernode/LEP-6-chain-client-extensions

Conversation

@j-rafique
Copy link
Copy Markdown
Contributor

No description provided.

@j-rafique j-rafique self-assigned this Apr 30, 2026
@j-rafique j-rafique force-pushed the supernode/LEP-6-chain-client-extensions branch from fba3844 to 6d3160b Compare May 4, 2026 11:39
@j-rafique j-rafique marked this pull request as ready for review May 4, 2026 11:46
@roomote-v0
Copy link
Copy Markdown

roomote-v0 Bot commented May 4, 2026

Rooviewer Clock   See task

The new LEP-6 chain-driven self-healing runtime (dispatcher, healer, verifier, finalizer, transport handler, SQLite dedup, cascade staging/publish, peer client) is well-structured, thoroughly tested, and cleanly integrated. The previous concurrency concern has been addressed with a sync.RWMutex guard. No new issues found.

  • SetProofResultProvider writes proofResultProvider without synchronization while tick() reads it concurrently -- consider an atomic pointer or mutex guard
Previous reviews

Mention @roomote in a comment to request specific changes to this pull request or fix all unresolved issues.

Comment thread supernode/host_reporter/service.go
@j-rafique j-rafique force-pushed the supernode/LEP-6-chain-client-extensions branch from 6d3160b to 38849ea Compare May 4, 2026 12:03
@j-rafique j-rafique force-pushed the supernode/LEP-6-chain-client-extensions branch from 38849ea to fd7466f Compare May 4, 2026 12:03
roomote-v0[bot]
roomote-v0 Bot previously approved these changes May 4, 2026
Introduces pkg/storagechallenge/deterministic/lep6.go, the off-chain
computation library shared by the storage_challenge runtime, recheck
service, and self-healing dispatcher. Every function is pure (no I/O,
no clock, no goroutines) so independent reporters challenging the same
(target, ticket) pair produce byte-identical StorageProofResult fields.

Functions land in two categories:

CHAIN-MIRRORED (must match lumera/x/audit/v1/keeper/audit_peer_assignment.go
byte-for-byte; the chain re-runs them to validate MsgSubmitEpochReport):

  - SelectLEP6Targets — 1/3 deterministic target subset
    (SHA-256(seed||0x00||account||0x00||"challenge_target"),
    targetCount = ceil(N/divisor) clamped to [1, N])
  - PairChallengerToTarget / AssignChallengerTargets — challenger->target
    pairing (label "pair"), with no-self and lex tie-break

SUPERNODE-CANONICAL (chain stores outputs as opaque strings; this file
defines the canonical encoding all reporters must use to stay in lockstep):

  - ClassifyTicketBucket — RECENT/OLD bucket classification using
    Action.BlockHeight (Action.UpdatedHeight does not exist; see
    docs/plans/LEP6_SUPERNODE_IMPLEMENTATION_PLAN.md "Resolved Decision 3")
  - SelectTicketForBucket — deterministic per-(target,bucket) ticket pick
    with excluded-set support for active heal ops
  - SelectArtifactClass — LEP-6 §10 weighted roll (20% INDEX / 80% SYMBOL)
    with deterministic fallback when a class has no artifacts
  - SelectArtifactOrdinal — uniform ordinal mod artifactCount
  - ComputeMultiRangeOffsets — k=4 range offsets in [0, size-rangeLen)
  - ComputeCompoundChallengeHash — BLAKE3 over concat of slices in offset
    order (lukechampine.com/blake3 to match the chain's library)
  - DerivationInputHash — canonical hex of derivation inputs
  - TranscriptHash — full canonical transcript identifier with sorted
    observer ids; struct-input form prevents field-order mistakes

Domain separators ("challenge_target", "pair", "ticket_rank",
"artifact_class", "artifact_ordinal", "range_offset",
"derivation_input", "transcript") and enum string forms ("INDEX"/
"SYMBOL", "RECENT"/"OLD"/"PROBATION"/"RECHECK") are package
constants; freezing them prevents accidental drift between callers and
tests. Any change is a protocol-level break that requires versioning.

Tests:
  - TestStorageTruthAssignmentHash_KnownVector locks the byte-level SHA-256
    composition against an independent computation, guaranteeing the
    chain-mirrored helper has not drifted.
  - TestSelectLEP6Targets_OneThirdCoverage_AssignmentMatchesChain uses the
    chain's own audit_peer_assignment_test.go fixture
    (seed="01234567890123456789012345678901", active={sn-a..sn-f},
    divisor=3) — output {sn-f, sn-e}.
  - TestAssignChallengerTargets_KnownAssignment locks the full pairing
    {sn-a -> sn-f, sn-b -> sn-e}.
  - TestSelectArtifactClass_WeightedDistribution validates ~20% INDEX
    over 5000 draws (±2% tolerance).
  - Determinism, sensitivity, error-path, and out-of-bounds tests for
    every primitive.

Verified: `go test ./pkg/storagechallenge/deterministic/...` passes; the
existing deterministic_test.go pre-LEP-6 tests continue to pass.
roomote-v0[bot]
roomote-v0 Bot previously approved these changes May 4, 2026
…#288)

Implements the PR3 compound storage challenge runtime on top of the latest
LEP-6 PR1 foundation branch after PR2 was merged into it.

Highlights:
- Add compound proof request/response fields and regenerated supernode proto bindings.
- Add recipient-side GetCompoundProof handler with signed responses and range validation.
- Add challenger-side LEP6Dispatcher for assigned-target dispatch across RECENT/OLD buckets.
- Add result buffer implementing host_reporter ProofResultProvider with deterministic chain-cap throttling.
- Add deterministic cascade metadata resolution helpers for artifact count, key, and exact artifact size.
- Add production ChainTicketProvider backed by final Lumera x/action ListActionsBySuperNode query.
- Wire startup to use ChainTicketProvider and cascade metadata/action size resolution instead of NoTicketProvider.
- Classify target RPC timeout/no-response as TIMEOUT_OR_NO_RESPONSE and malformed transcripts as INVALID_TRANSCRIPT.
- Extend action module bindings/mocks with ListActionsBySuperNode.
- Preserve PR1 provider concurrency hardening and PR2 deterministic roocode fixes after rebase.

Lumera dependency/source:
- github.com/LumeraProtocol/lumera v1.12.0
- chain source: lumera/master 451f8a8e7ff30b3370cba59fab8e6228473a348b

Validation:
- git diff --check origin/supernode/LEP-6-chain-client-extensions..HEAD: pass
- go test ./pkg/storagechallenge/... ./supernode/storage_challenge ./supernode/transport/grpc/storage_challenge ./supernode/host_reporter ./pkg/lumera/modules/action ./pkg/lumera/modules/audit ./pkg/lumera/modules/audit_msg -count=1 -v: pass
- go vet ./pkg/storagechallenge/... ./supernode/storage_challenge ./supernode/transport/grpc/storage_challenge ./supernode/host_reporter ./pkg/lumera/modules/action ./pkg/lumera/modules/audit ./pkg/lumera/modules/audit_msg: pass
- go test ./... -count=1: pass

Plan: docs/plans/LEP6_SUPERNODE_IMPLEMENTATION_PLAN_v3_MASTER.md PR3
roomote-v0[bot]
roomote-v0 Bot previously approved these changes May 4, 2026
…289)

Replaces the gonode-era peer-watchlist self-healing with a chain-mediated
LEP-6 §18-§22 (Workstream C) implementation. Healer reconstructs locally
and STAGES (no KAD publish), verifiers fetch reconstructed bytes from the
assigned healer over a streaming gRPC RPC (§19 healer-served path) and
hash-compare against op.ResultHash, then publish to KAD only after chain
VERIFIED quorum.

Three-phase flow

  Phase 1 — RECONSTRUCT (no publish)
    cascade.RecoveryReseed(PersistArtifacts=false, StagingDir) →
    download remaining symbols → RaptorQ-decode → verify file hash
    against Action.DataHash → re-encode → stage symbols+idFiles+layout
    +reconstructed.bin to ~/.supernode/heal-staging/<op_id>/. Submit
    MsgClaimHealComplete{HealManifestHash}; chain transitions
    SCHEDULED → HEALER_REPORTED, sets op.ResultHash = HealManifestHash.

  Phase 2 — VERIFY (§19 healer-served path)
    Verifier opens supernode.SelfHealingService/ServeReconstructedArtefacts
    on the assigned healer (op.HealerSupernodeAccount), streams the
    reconstructed bytes, computes BLAKE3 base64 (=Action.DataHash recipe
    via cascadekit.ComputeBlake3DataHashB64), compares against
    op.ResultHash (NOT Action.DataHash — chain enforces at
    lumera/x/audit/v1/keeper/msg_storage_truth.go:291), and submits
    MsgSubmitHealVerification{verified, hash}. Chain quorum n/2+1.

  Phase 3 — PUBLISH (only on VERIFIED)
    Finalizer polls heal_claims_submitted (Opt 2b per-op poll, folded
    into single tick loop alongside healer + verifier dispatch), reads
    op.Status, calls cascade.PublishStagedArtefacts on VERIFIED (same
    storeArtefacts path as register/upload), deletes staging on
    FAILED/EXPIRED. Chain may reschedule a different healer on
    EXPIRED.

Crash-recovery / restart-safety

  Submit-then-persist ordering: SQLite dedup row is written ONLY after
  chain has accepted the tx. A failed submit (mempool, signing, chain
  reject) leaves no row and staging is removed, so the next tick can
  retry cleanly. If chain accepted a prior submit but the supernode
  crashed before persisting, the next tick's resubmit fails with "does
  not accept healer completion claim" and reconcileExistingClaim
  re-fetches the heal-op, confirms chain ResultHash equals our manifest,
  and persists the dedup row so finalizer takes over.

  Negative-attestation hash: chain rejects empty VerificationHash even
  on verified=false (msg_storage_truth.go:271-273). Verifier synthesizes
  a deterministic non-empty placeholder
  (sha256("lep6:negative-attestation:"+reason) base64) on fetch_failed
  and hash_compute_failed paths. Chain only validates VerificationHash
  content for positive votes (msg_storage_truth.go:288-294), so any
  non-empty value is well-formed for negatives.

Components added

  supernode/self_healing/
    service.go      Single tick loop; mode gate (UNSPECIFIED skips);
                    healer dispatch; verifier dispatch; finalizer poll;
                    sync.Map in-flight + buffered semaphores
                    (reconstructs=2, verifications=4, publishes=2).
    healer.go       Phase 1: submit-then-persist ordering;
                    reconcileExistingClaim handles post-crash recovery
                    when chain accepted a prior submit.
    verifier.go     Phase 2: fetch from assigned healer, retry with
                    exponential backoff (3 attempts), submit verified=
                    false with non-empty placeholder hash on persistent
                    fetch failure; positive-path hash compares against
                    op.ResultHash; reconciles chain-side
                    "verification already submitted" idempotency.
    finalizer.go    Phase 3: VERIFIED → publish + cleanup; FAILED/
                    EXPIRED → cleanup only; transient states no-op.
    peer_client.go  secureVerifierFetcher dials via the same
                    secure-rpc / lumeraid stack the legacy
                    storage_challenge loop uses.

  supernode/transport/grpc/self_healing/handler.go
    Streaming ServeReconstructedArtefacts RPC.
    DefaultCallerIdentityResolver pulls verifier identity from the
    secure-rpc (Lumera ALTS) handshake via
    pkg/reachability.GrpcRemoteIdentityAndAddr — production wiring uses
    this so req.VerifierAccount is never trusted alone. Authorizes
    caller ∈ op.VerifierSupernodeAccounts AND identity ==
    op.HealerSupernodeAccount; refuses with FailedPrecondition when
    not the assigned healer and PermissionDenied for unassigned callers.
    1 MiB chunks.

  proto/supernode/self_healing.proto
    SelfHealingService { ServeReconstructedArtefacts streams chunks }.
    Makefile gen-supernode wires it; gen/supernode/self_healing*.pb.go
    regenerated.

  supernode/cascade/reseed.go
    Split RecoveryReseed: PersistArtifacts=true (legacy/republish) vs
    PersistArtifacts=false (LEP-6 stage-only). Adds stageArtefacts +
    PublishStagedArtefacts. Stages reconstructed file bytes and a
    JSON manifest the §19 transport reads.
  supernode/cascade/staged.go
    ReadStagedHealOp helper used by the transport handler.
  supernode/cascade/interfaces.go
    CascadeTask interface gains RecoveryReseed + PublishStagedArtefacts
    so self_healing depends only on the factory abstraction.

  pkg/storage/queries/self_healing_lep6.go
    Tables heal_claims_submitted (PK heal_op_id) and
    heal_verifications_submitted (PK (heal_op_id, verifier_account))
    for restart dedup. Typed sentinel errors
    ErrLEP6ClaimAlreadyRecorded / ErrLEP6VerificationAlreadyRecorded.
    Migrations wired in OpenHistoryDB.
  pkg/storage/queries/local.go
    LocalStoreInterface embeds LEP6HealQueries.

  supernode/config/config.go
    SelfHealingConfig YAML block (enabled, poll_interval_ms,
    max_concurrent_*, staging_dir, verifier_fetch_timeout_ms,
    verifier_fetch_attempts). Default disabled until activation.

  supernode/cmd/start.go
    Constructs selfHealingService.Service + selfHealingRPC.Server
    (with DefaultCallerIdentityResolver) when SelfHealingConfig.Enabled,
    registers SelfHealingService_ServiceDesc on the gRPC server,
    appends the runner to the lifecycle services list. Reuses cService
    (cascade factory) and historyStore.

Tests (16 mandatory; all PASS)

  supernode/self_healing/service_test.go
    1.  TestVerifier_ReadsOpResultHashForComparison       (R-bug pin)
    2.  TestVerifier_HashMismatchProducesVerifiedFalse
    2b. TestVerifier_FetchFailureSubmitsNonEmptyHash      (BLOCKER pin)
    3.  TestVerifier_FetchesFromAssignedHealerOnly        (§19 gate)
    6.  TestHealer_FailedSubmitDoesNotPersistDedupRow     (ordering)
    6b. TestHealer_ReconcilesExistingChainClaimAfterCrash (recovery)
    7.  TestHealer_RaptorQReconstructionFailureSkipsClaim (Scenario C1)
    8.  TestFinalizer_VerifiedTriggersPublishToKAD        (Scenario A)
    9.  TestFinalizer_FailedSkipsPublish_DeletesStaging   (Scenario B)
    10. TestFinalizer_ExpiredSkipsPublish_DeletesStaging  (Scenario C2)
    11. TestService_NoRoleSkipsOp
    12. TestService_UnspecifiedModeSkipsEntirely          (mode gate)
    13. TestService_FinalStateOpsIgnored
    14. TestDedup_RestartDoesNotResubmit                  (3-layer dedup)
  supernode/transport/grpc/self_healing/handler_test.go
    4. TestServeReconstructedArtefacts_AuthorizesOnlyAssignedVerifiers
    5. TestServeReconstructedArtefacts_RejectsUnassignedCaller
       (also covers non-assigned-healer FailedPrecondition refusal)
  pkg/storage/queries/self_healing_lep6_test.go
    TestLEP6_HealClaim_RoundTripAndDedup
    TestLEP6_HealVerification_PerVerifierDedup

Validation

  go test ./supernode/self_healing/...                 PASS (2.66s)
  go test ./supernode/transport/grpc/self_healing/...  PASS (0.09s)
  go test ./supernode/cascade/...                      PASS (0.09s)
  go test ./pkg/storage/queries/...                    PASS (0.20s)
  go test ./pkg/storagechallenge/... ./supernode/storage_challenge \
          ./supernode/host_reporter ./pkg/lumera/modules/audit \
          ./pkg/lumera/modules/audit_msg                  PASS
  go vet (touched + all transitively reachable pkgs)      PASS
  go build (targeted)                                     PASS
  (full repo go build fails only on pre-existing
   github.com/kolesa-team/go-webp libwebp-dev system-header issue;
   unrelated to this change.)

Resolved decisions applied

  ✓ Branch base: PR-3 tip f79f88f, NOT self-healing-improvements
    (single chain-driven service per Bilal direction; legacy 3-way
    Request/Verify/Commit RPC discarded).
  ✓ Verifier compares against op.ResultHash (chain msg_storage_truth.go
    :291). Pinned by TestVerifier_ReadsOpResultHashForComparison.
  ✓ Hash recipe = cascadekit.ComputeBlake3DataHashB64 (=Action.DataHash
    recipe). Same recipe healer + verifier + chain enforce.
  ✓ KAD publish AFTER chain VERIFIED (§19 healer-served-path gate);
    staging directory is the only authority before quorum.
  ✓ Finalizer mechanism: Opt 2b (per-op GetHealOp poll, folded into
    single tick loop) — no Tendermint WS, no monotonic-growth poll.
  ✓ Concurrency default: semaphore=2 reconstructs (RaptorQ RAM-aware),
    4 verifications, 2 publishes.
  ✓ Mode gate: UNSPECIFIED skips dispatcher entirely (Service.tick
    early-return; verified by TestService_UnspecifiedModeSkipsEntirely).
  ✓ Three-layer dedup: sync.Map + bounded semaphores + SQLite
    (heal_claims_submitted + heal_verifications_submitted).
  ✓ Submit-then-persist ordering with reconcile path for crash recovery.
  ✓ Non-empty placeholder VerificationHash on negative attestations
    (chain rejects empty regardless of verified bool).
  ✓ Caller authentication via secure-rpc / Lumera ALTS handshake at
    transport layer; req.VerifierAccount never trusted alone in
    production.

Plan: docs/plans/LEP6_PR4_EXECUTION_PLAN.md
roomote-v0[bot]
roomote-v0 Bot previously approved these changes May 4, 2026
Implements the PR-5 Supernode side of LEP-6 storage-truth recheck evidence on top of the PR-4 heal-op dispatch branch.

Public surfaces added:
- supernode/recheck: Candidate, RecheckResult, Finder, Attestor, Service, ReporterSource, SupernodeReporterSource, eligibility and outcome mapping helpers.
- pkg/storage/queries: RecheckQueries plus SQLite-backed HasRecheckSubmission and RecordRecheckSubmission.
- pkg/lumera/modules/audit: GetEpochReportsByReporter query wrapper for network-wide candidate discovery.
- supernode/storage_challenge: LEP6Dispatcher.Recheck to execute RECHECK-bucket proofs without adding results to epoch reports.

Spec/chain alignment decisions:
- Candidate discovery is network-wide: the service lists registered supernodes and scans EpochReportsByReporter over the configured lookback window, rather than only scanning this node's own report.
- Recheck candidate eligibility mirrors chain storage transcript records: only HASH_MISMATCH, TIMEOUT_OR_NO_RESPONSE, OBSERVER_QUORUM_FAIL, and INVALID_TRANSCRIPT originals are eligible.
- The service rejects self-target candidates and self-reported challenged results because chain SubmitStorageRecheckEvidence rejects creator == challenged_supernode_account and creator == challenged result reporter.
- Recheck execution maps local PASS to PASS and confirmed hash mismatch to RECHECK_CONFIRMED_FAIL; timeout/quorum/invalid transcript classes remain explicit and are not collapsed.
- Recheck execution reuses the PR-3 compound dispatcher in RECHECK bucket mode with an isolated temporary buffer so recheck results are submitted only through MsgSubmitStorageRecheckEvidence and are never included in host epoch reports.
- Local dedup is submit-then-persist keyed by epoch_id + ticket_id (creator/self is implicit locally); tx hard-fail does not persist, while chain replay/already-submitted errors persist local dedup for idempotence.
- Startup/config wiring is additive under storage_challenge.lep6.recheck and remains disabled unless explicitly enabled.

Tests added/updated:
- Eligibility matrix for all eligible and rejected result classes.
- Outcome mapping for PASS, RECHECK_CONFIRMED_FAIL, timeout, quorum, and invalid transcript.
- Finder lookback/order/limit/local-dedup behavior.
- Network-wide reporter discovery regression so peer-reported failures are discovered and not self-report-only.
- Self-target and self-reported candidate rejection pinned against chain validation.
- Service mode gate and submit path.
- Attestor submit-then-persist, tx hard-fail retry safety, idempotent already-submitted handling, and required-field rejection.
- SQLite recheck submission idempotence/dedup preservation.
- Dispatcher RECHECK execution path integration through focused package tests.

Validation:
- PATH=/home/openclaw/.local/go/bin:$PATH go test ./supernode/recheck ./pkg/storage/queries ./supernode/storage_challenge ./supernode/cmd ./pkg/lumera/modules/audit => PASS
- PATH=/home/openclaw/.local/go/bin:$PATH go test ./supernode/host_reporter ./supernode/self_healing ./supernode/transport/grpc/self_healing ./supernode/recheck ./pkg/storage/queries ./supernode/storage_challenge ./supernode/cmd ./pkg/lumera/modules/audit => PASS
- PATH=/home/openclaw/.local/go/bin:$PATH go vet ./supernode/recheck ./pkg/storage/queries ./supernode/storage_challenge ./supernode/cmd ./pkg/lumera/modules/audit ./supernode/host_reporter ./supernode/self_healing ./supernode/transport/grpc/self_healing => PASS
- git diff --check => PASS
- PATH=/home/openclaw/.local/go/bin:$PATH go test ./... => expected local environment failure only in pkg/storage/files due missing go-webp system headers webp/decode.h and webp/encode.h; other visible packages pass.

Parent: supernode/LEP-6-heal-op-dispatch @ 043fba4.
Copy link
Copy Markdown
Collaborator

@mateeullahmalik mateeullahmalik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Production-Gate Review — PR #286 feat(lumera): add LEP-6 chain client extensions

Reviewed at head 96764d8. 108 files / +15.8k / −903. Three parallel deep passes across (a) self-healing, (b) storage-challenge dispatch + recheck, (c) chain-client / config / observability.

I'm flagging only real bugs / safety / consensus-or-fee-flow issues — no nits, no style.


🔴 CRITICAL

C1. applyLEP6DefaultsAndValidate silently auto-opts every operator into LEP-6 on upgrade
supernode/config/lep6.go:80-113

if !c.StorageChallengeConfig.LEP6.enabledSet { c.StorageChallengeConfig.LEP6.Enabled = true }
if !recheck.enabledSet { recheck.Enabled = true }
if !c.SelfHealingConfig.enabledSet { c.SelfHealingConfig.Enabled = true }

Existing operator configs predating this PR carry none of those YAML blocks → enabledSet=false for all three → on next start the LEP-6 dispatcher, recheck attestor, and self-healing healer/verifier/finalizer all flip ON. The chain-wide enforcement-mode gate is the only safety net, and that's a global, not an operator-local opt-in. Effect: unannounced gas/fee burn on every upgraded SN, plus surprise RAM-heavy RaptorQ reseeds on tight boxes. Default these to false when omitted, or gate behind a single explicit lep6.enabled knob.

C2. Local recheck dedup is keyed on (epoch, ticket) but chain dedup is (epoch, ticket, target) — legitimate rechecks dropped
pkg/storage/queries/recheck.go:33 (PRIMARY KEY (epoch_id, ticket_id)), :53 (HasRecheckSubmission), supernode/recheck/finder.go:98 (in-memory seen keyed epoch/ticket)
The schema has the target_account column but doesn't include it in the PK or any lookup. If a ticket has multiple challenged proof results in one epoch (different targets — common when two SNs both hold the artifact), we submit a recheck for the first only and forever mark the (epoch, ticket) "submitted." Second target never gets evidence on chain. This skews chain-side N/R/D math and weakens the cross-checking LEP-6 relies on. Make PK (epoch_id, ticket_id, target_account) and thread target through HasRecheckSubmission / MarkRecheckSubmissionSubmitted / DeletePendingRecheckSubmission and the finder's seen map.

C3. Healer drops staging dir + dedup row on any unclassified submit error → silent data loss after committed-but-ack-lost tx
supernode/self_healing/healer.go:87-99

if isChainHealOpInvalidState(err) { … return nil }   // matches ONE substring
_ = s.store.DeletePendingHealClaim(ctx, op.HealOpId)
_ = os.RemoveAll(stagingDir)                         // destructive

isChainHealOpInvalidState only matches the literal string "does not accept healer completion claim". A lost ack on an actually-committed MsgClaimHealComplete (gRPC Unavailable / DeadlineExceeded / context canceled) is indistinguishable from a real reject and we wipe the staging dir. Phase-1 doesn't publish to KAD until VERIFIED, so the only copy of the reconstructed bytes is gone. Chain reaches VERIFIED, finalizer has no claim row → PublishStagedArtefacts is never called → the network never gets the data. On any unclassified submit error, reconcile via GetHealOp (same path as reconcileExistingClaim) before deleting staging.

C4. isChainHealOpNotFound over-matches "not found" substring → wipes live claims on transient query errors
supernode/self_healing/finalizer.go:113-119

return strings.Contains(msg, "not found") || strings.Contains(msg, "not_found")

Any error string containing "not found" (gRPC block N not found, codec lookup miss, key-not-found) triggers cleanupClaim(EXPIRED)os.RemoveAll(stagingDir). Same data-loss surface as C3, driven by query failures. Match status.Code(err) == codes.NotFound plus a typed audittypes sentinel.

C5. Pre-staged pending row blocks all subsequent retries forever after crash between INSERT and submit
supernode/self_healing/service.go:334, healer.go:77-87, verifier.go:127-137
RecordPendingHealClaim writes status='pending' before the chain submit. If the SN dies between INSERT and a successful ClaimHealComplete, on restart HasHealClaim returns true (any status) and dispatchHealerOps skips this op forever. Chain still says SCHEDULED → finalizer's default branch is no-op. Heal-op silently expires; SN is penalised. Same for verifier between RecordPendingHealVerification and SubmitHealVerification → quorum may fail. Either make HasHealClaim/HasHealVerification count only status='submitted', or have the dispatcher resume pending rows.

C6. GetCompoundProof has no per-call cap on len(Ranges) or aggregate bytes → DoS + bulk-exfil channel
supernode/transport/grpc/storage_challenge/handler.go:283-353
The simple GetSliceProof enforces maxServedSliceBytes = 65_536 (line 23). GetCompoundProof validates only "all ranges same size" + "end ≤ ArtifactSize". No cap on len(req.Ranges), no cap on requestRangeLen, no aggregate byte cap. Spec contract is k=4 × 256B = 1 KiB; an authenticated peer can request 1000 ranges × 100 MB or the whole artifact. Both a DoS vector and a bulk-data-exfiltration path that bypasses the cascade access path. Reject len(Ranges) > MaxRanges (e.g. 16), requestRangeLen > 4 × LEP6CompoundRangeLenBytes, and aggregate ≤ 16 KiB.


🟠 HIGH

H1. No deadline-epoch check before ClaimHealComplete / SubmitHealVerification / recheck submit → fee burn + wasted CPU/RAM on guaranteed-rejected txs
supernode/self_healing/healer.go:87, verifier.go:137, recheck/attestor.go:41
HealOp.DeadlineEpochId is carried in the proto but never consulted. RaptorQ reseed plus VerifierFetchAttempts=3 × VerifierFetchTimeout=60s + backoff can take 3+ minutes; with storage_truth_heal_deadline_epochs=2 (system tests) the deadline routinely passes mid-flow. The chain rejects past-deadline submissions, and the rejection error string is not in isChainHealOpInvalidState, so the dispatcher retries every poll until status flips — repeat-burning fees. Fetch current epoch (Audit().GetCurrentEpoch) before each submit; if current >= deadline, cleanup + skip.

H2. Service.listOps requests pagination=nil and never walks next_key → heal-ops silently dropped at scale
supernode/self_healing/service.go:451

resp, err := s.lumera.Audit().GetHealOpsByStatus(queryCtx, status, nil)

Cosmos-SDK default page = 100. Under load (chain holding many SCHEDULED/HEALER_REPORTED ops), the SN that's the assigned healer or verifier for any op past page 1 will simply never see it — silent missed heal/verification, on-chain penalty, missed quorum. Loop on resp.Pagination.NextKey until empty.

H3. Per-target dispatch failure swallowed — no StorageProofResult emitted → permanent gap in chain N/R/D math
supernode/storage_challenge/lep6_dispatch.go:295-299, plus early-return paths at :351, 372, 376, 380, 394, 403, 475, 479
Many recoverable failures (cascade-meta fetch, ordinal selection, artifact-size resolve, transcript hash, sign) just log.Warn(...); return err from dispatchTicket, with no row appended. Chain interprets absent evidence for an assigned (challenger, target, bucket) slot as missing — no other reporter fills it because the slot is taken. Internal failures should fall through to appendFail(...) with INVALID_TRANSCRIPT (or a new INTERNAL_ERROR class); reserve return err for ctx.Err().

H4. appendFail / appendNoEligible discard sign errors via sig, _ := snkeyring.SignBytes(...) → empty/garbage signature attached to result row
supernode/storage_challenge/lep6_dispatch.go:319-323, 526-530

sig, _ := snkeyring.SignBytes(...)

Empty ChallengerSignature on a StorageProofResult will fail chain validation. Per MsgSubmitEpochReport semantics, a single malformed result rejects the entire epoch report → one transient keyring failure poisons the whole epoch's evidence. Propagate the error or skip the entry; never emit a row with empty signature.

H5. Result-buffer >16 throttle drops by ticket_id lex order, not by age or signal-value
supernode/storage_challenge/result_buffer.go:110-123
Comment claims "drop oldest first" but the code sorts by nonRecent[i].TicketId < nonRecent[j].TicketId. Ticket IDs are content-addressed; lex order has no relation to submission order. When >16 results overflow the chain cap, an attacker who can shape ticket IDs (or just lucky lex order) determines what reaches the chain. There's no per-(target, bucket) fairness either — all dropped slots can hit one bucket, biasing chain-side coverage stats. Either rename honestly to "lex-deterministic-drop", or carry a submission timestamp / round-robin across (target, bucket).

H6. SelectArtifactClass cross-class fallback may not match chain's authoritative class → wrong N/R/D delta routing
pkg/storagechallenge/deterministic/lep6.go:423-440
When the rolled class has zero artifacts, the supernode silently swaps to the other class. Per LEP-6 spec §14, "Symbol vs index hash mismatch artifact-class affects D/N deltas — supernode must report the correct class." If chain re-derives or validates class independently and doesn't mirror this fallback, deltas land in the wrong bucket and the trust multiplier (R/100, Class A pre-recheck only) is misapplied. Either pin a chain-anchored test vector for the fallback or emit NO_ELIGIBLE_TICKET instead of swapping classes.

H7. Verifier streaming has no max-bytes guard → buggy/malicious healer can OOM verifier
supernode/self_healing/peer_client.go:105-120
First message advertises TotalSize, but the verifier never compares accumulated length, never enforces a ceiling, never refuses oversized chunks. Read TotalSize, validate ≤ MaxReconstructedBytes, pre-allocate, and abort if len(buf)+len(msg.Chunk) exceeds TotalSize.

H8. ServeReconstructedArtefacts doesn't check op.Status → bytes served outside §19 healer-served-path window
supernode/transport/grpc/self_healing/handler.go:126-148
Authorises by healer identity + verifier-set membership. Per §19, the fetch is valid only while op.Status == HEALER_REPORTED. After VERIFIED (artefacts published to KAD), FAILED, or EXPIRED, the staging dir may still exist (cleanup is best-effort) and a former assigned verifier can pull bytes that no longer represent canonical state. Mid-stage (still SCHEDULED) it could serve a partial file. Reject with FailedPrecondition unless op.Status == HEAL_OP_STATUS_HEALER_REPORTED.

H9. Chain-error classification is fragile English-substring matching across the board
supernode/self_healing/healer.go:175-181, verifier.go:162-167, recheck/attestor.go:73-78, finalizer.go:113-119
Idempotency / dedup branches all key on strings.Contains(err.Error(), "...") against literal English chain error strings ("verification already submitted by creator", "recheck evidence already submitted", "does not accept healer completion claim", "not found"). Any chain-side error refactor (sdk version bump, error wrap, i18n) silently flips the branch into the destructive delete pending row + RemoveAll(stagingDir) path and re-submits forever. Match on typed errors via errors.Is against exported audit-module sentinels, or sdk error codes.


🟡 MEDIUM

M1. StagingDir default "heal-staging" is a relative path → process-CWD-dependent
supernode/config/defaults.go:31, supernode/cmd/start.go:301-322, supernode/config.yml:59
applyLEP6DefaultsAndValidate always pre-fills with the relative literal so the withDefaults() ~/.supernode/heal-staging fallback never fires. Under systemd WorkingDirectory=/, multi-GB reconstructed artefacts land in /heal-staging. Resolve through Config.GetFullPath(...) before passing to Service.New, or default to filepath.Join(BaseDir, "heal-staging").

M2. Goroutines in tick use the long-lived Run ctx — wedged ops never release semaphore slots
supernode/self_healing/service.go:344-353, 397-414, 434-442
semReconstruct/semVerify slots and inFlight keys leak forever on hung peer fetches or hung RaptorQ. Wrap with context.WithDeadline derived from op.DeadlineEpochId (or a hard ceiling).

M3. historyStore.CloseHistoryDB runs while LEP-6 services are still draining
supernode/cmd/start.go:413-440
After cancel(), in-flight ClaimHealComplete calls don't honor cancellation immediately. They subsequently call MarkHealClaimSubmitted against a closed DB → error → pending row never cleared → next start reconstructs again. Move CloseHistoryDB after <-servicesErr.

M4. ResolveArtifactSize for INDEX class re-runs cascadekit.GenerateIndexFiles per dispatch — perf hit + cross-version determinism risk
pkg/storagechallenge/lep6_resolution.go:139-153
Two reporters on slightly different cascadekit versions compute different sizes → different ComputeMultiRangeOffsets → different derivation_input_hash → chain treats reports as contradictions. Cache per-ticket; pin cascadekit version explicitly; add a chain-anchored test vector.

M5. RecoveryReseed reads entire reconstructed file into RAM via os.ReadFile to copy to staging
supernode/cascade/reseed.go:256-263, :346
With MaxConcurrentReconstructs=2 and large actions, peak RAM = 2 × file_size on top of the RaptorQ working set. OOM at heal time, exactly when the operator can least afford it. Use io.Copy or os.Rename (same FS).

M6. probeTCP reports DNS / route errors as PORT_STATE_CLOSED
supernode/host_reporter/service.go:348-364
Transient EHOSTUNREACH / DNS failure / ctx.Err() becomes permanent on-chain port-closed evidence. Return PORT_STATE_UNKNOWN for those; only CLOSED on explicit ECONNREFUSED.

M7. dispatchFinalizer skipped entirely under modeGate → staging dirs leak forever after a mode rollback
supernode/self_healing/service.go:276-283
If governance flips StorageTruthEnforcementMode back to UNSPECIFIED while pending claim rows + staging dirs exist, the finalizer never runs. Run dispatchFinalizer regardless of mode; only gate dispatch (healer/verifier) phases.

M8. SQLite ALTER TABLE … ADD COLUMN status runs unguarded on every startup
pkg/storage/queries/self_healing_lep6.go:85,99 (invoked at sqlite.go:401,409)
On a fresh DB the prior CREATE TABLE already has the column → ALTER returns "duplicate column name" → silently swallowed (_, _ =). The swallow also masks real errors (disk full, locked DB). Guard via PRAGMA table_info lookup; only ALTER if column missing; surface real errors. Same for recheck.go:37.

M9. Restart replay: lastRunEpoch is in-memory only
supernode/storage_challenge/service.go:170, 204
On restart the SN re-dispatches the same epoch even if host_reporter already submitted MsgSubmitEpochReport. Persist lastSubmittedEpoch per key in SQLite.

M10. ticket_provider.go requires BOTH IndexArtifactCount AND SymbolArtifactCount non-zero
supernode/storage_challenge/ticket_provider.go:108
lep6_resolution.go:42-46 says "If both counts are zero (legacy) chain accepts"; chain may also accept one-zero. This filter silently makes such tickets invisible to LEP-6 dispatch. Require at least one non-zero.

M11. Recheck() shadow-swaps the dispatcher's main buffer under a long-held lock
supernode/storage_challenge/lep6_recheck.go:42-48
Lock is held across an entire RPC round-trip to peer, serialising all rechecks and dispatches network-wide on a single SN. Pass the buffer as a parameter; don't mutate shared state under a long lock.

M12. sequenceMismatchMaxAttempts = 3 hard-coded, not threaded through TxHelperConfig
pkg/lumera/modules/tx/helper.go:19
Three new high-rate tx surfaces (ClaimHealComplete, SubmitHealVerification, SubmitStorageRecheckEvidence) inherit a non-tunable cap. Operators can't tune under chain congestion. Plumb through TxHelperConfig with the same MaxGasAdjustmentAttemptsCap-style hard ceiling enforced in both applyTxHelperDefaults AND UpdateConfig (per the supernode tx-helper safety-cap-mirroring rule).


🟢 LOW

  • L1. Server.NewServer accepts nil resolver → in tests/silent wiring drift, handler trusts user-supplied req.VerifierAccount. Reject nil outright in production constructor. (transport/grpc/self_healing/handler.go:118-120)
  • L2. negativeAttestationHash mixes raw fetchErr.Error() text → per-verifier-unique placeholder hash. Hash a small canonical reason taxonomy. (self_healing/verifier.go:62-63, 116-119)
  • L3. RecordPendingRecheckSubmission uses INSERT OR IGNORE → silently masks duplicate-attempt scenarios; we then submit anyway and the chain rejects. (pkg/storage/queries/recheck.go:74)
  • L4. Recheck finder returns whole-tick failure on first per-reporter RPC error → one unreachable reporter masks every other reporter's candidates. (recheck/finder.go:70-81)
  • L5. appendNoEligible discards the actually-selected ticket id → loses observability and merges two distinct error modes into the same row shape. (storage_challenge/lep6_dispatch.go:358-361)
  • L6. tests/system/config.lep6-1.yml has storage_challenge.enabled: false but lep6.recheck.enabled: true → dead recheck block; foot-gun for anyone copying the template.

✅ Audited and clean

  • audit_msg/impl.go — uses standard txHelper.ExecuteTransaction for all 5 methods; NewModule legacy + NewModuleWithTxHelperConfig constructors both preserved (public-API rule honored).
  • applyTxHelperDefaults + UpdateConfig both enforce MaxGasAdjustmentAttemptsCap (mirroring rule honored).
  • pkg/metrics/lep6 — labels are bounded enum strings; no per-ticket / per-account labels → no cardinality blow-up.
  • Hash determinism along the healer↔verifier path: both compute cascadekit.ComputeBlake3DataHashB64 over the full reconstructed file; healer derives manifest hash from meta.DataHash after VerifyB64DataHash succeeded against the same recipe.
  • §19 healer-served-path identity check in handler.go correctly prefers secure-RPC identity over req.VerifierAccount for caller resolution (when resolver wired).
  • Empty-verifier-set bypass: not present on supernode side — quorum is chain-side; SN never auto-finalizes.
  • pkg/netutil/hostport.go — IPv6 brackets, zone IDs, malformed inputs handled without panic.
  • supernode/config/save.go — config file written 0600, dir 0700. No secrets logged.

Test-coverage gaps that would have caught the above at CI

  • No test for SelectArtifactClass cross-class fallback against a chain-anchored vector (H6).
  • No test for >16 result-buffer throttle ordering / fairness (H5).
  • No restart-replay scenario for either heal-claim pending-row or storage-challenge lastRunEpoch (C5, M9).
  • No test for multi-target same-ticket recheck dedup (C2).
  • No test for chain-error classification — every "is-already-submitted" / "is-not-found" branch is on a hand-typed substring with no fixture pinning the exact chain error.
  • No test for GetCompoundProof adversarial range payloads (C6).

Verdict

REQUEST CHANGES. Six CRITICAL, nine HIGH. The two highest-impact classes are (1) silent data-loss paths in self-healing driven by ordinary transient errors (C3, C4, C5, H9) and (2) chain-side N/R/D math being silently wrong because of (a) the recheck dedup PK gap (C2), (b) swallowed dispatch errors (H3), (c) malformed-signature row emission (H4), and (d) the cross-class fallback non-determinism (H6). The auto-opt-in default (C1) and unbounded GetCompoundProof (C6) are operator-impact / DoS issues that should not ship as-is.

Most of these are localised fixes — the biggest structural one is replacing English-substring chain-error matching with typed sentinels, which is one diff that closes most of the data-loss surface.

Resolves all 33 items from mateeullahmalik's CHANGES_REQUESTED review on
PR #286. Per Matee's lens — silent data-loss, chain N/R/D math fragility,
operator-impact / fee-burn, DoS / bulk-exfil on authenticated handlers,
English-substring chain-error matching — every finding is closed without
spec divergence and with regression tests.

Highlights by failure class:

1. Typed chain-error sentinels (H9 umbrella; foundation for C3/C4/C5/H1/L3)
   - new pkg/lumera/chainerrors with predicates (errors.Is + gRPC code +
     substring fallback) and transient short-circuit; replaces every
     strings.Contains(err.Error(), …) under self_healing/recheck/
     storage_challenge.

2. Storage layer (C2, M8, M12, L3, L4)
   - recheck dedup migrated to (heal_op_id, target_supernode) PK with typed
     ErrAlreadyExists, INSERT … ON CONFLICT DO NOTHING.
   - PRAGMA-guarded ALTER TABLE migrations; tx-helper sequence cap and
     fairness; finder per-reporter error isolation.

3. Self-healing safety overhaul (C3, C4, C5, H1, H2, H7, M2, M5, M7, L2)
   - reconcile-not-purge for transient errors; pre-submit deadline-epoch
     check; paginated GetHealOpsByStatus; bounded streaming caps; per-op
     deadline goroutines; finalizer runs regardless of mode-gate; reseed
     via os.Rename / io.Copy; canonical negative-attestation reason taxonomy.

4. Storage-challenge dispatch + buffer (C2, H3, H4, H5, H6, M4, M9, M10,
   M11, L5)
   - chain-anchored partial rows for pre-derivation early returns
     (ctx.Err passthrough); sign errors drop the row + metric, never lie;
     arrival-order + (target,bucket) fairness buffer (no lex shaping);
     no class swap when rolled class empty (NO_ELIGIBLE_TICKET only);
     bounded LRU index-size cache; SQLite-persisted lastSubmittedEpoch;
     at-least-one-class-non-zero ticket gate; per-call recheck buffer.

5. Operator config / shutdown / probing (C1, M1, M3, M6, L6)
   - LEP-6 toggles default-FALSE on missing config; startup advisory WARN
     names each disabled service; structural validator rejects recheck=true
     with parents disabled; staging-dir resolved via GetFullPath;
     historyStore.CloseHistoryDB moved after services drain; probeTCP
     taxonomy distinguishes ECONNREFUSED (CLOSED) from DNS / EHOSTUNREACH /
     ctx.Err / Timeout (UNKNOWN); fixtures aligned with new gating chain.

6. Transport handlers (C6, H8, L1)
   - GetCompoundProof: per-call MaxCompoundRanges=16, per-range cap
     4*LEP6CompoundRangeLenBytes, MaxCompoundAggregateBytes=16 KiB;
     rejected before any artifact bytes are read.
   - ServeReconstructedArtefacts: gated on
     op.Status == HealOpStatus_HEAL_OP_STATUS_HEALER_REPORTED.
   - NewServer rejects nil resolveCaller; NewServerForTest is the
     documented test-only escape hatch.

Spec-fidelity: no scoring constants changed; no chain-side semantics altered.
Chain-anchored validator rules cited at chain path:line for every consensus-
affecting branch (PK shape, partial-row class, dispatch class fallback,
deadline sentinel).

Validation: go build, go vet, focused per-wave package tests, and full
go test $(go list ./... | grep -v /tests) -count=1 sweep — zero regressions
across 50+ packages.
@j-rafique j-rafique requested a review from mateeullahmalik May 6, 2026 15:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants